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Abstract 

Cuda and OpenCL are aimed at programmers developing paral- 
lel applications targeting GPUs and embedded micro-processors. 
These systems often have explicitly managed memories exposed 
directly though a notion of disjoint address spaces. OpenCL ad- 
dress spaces are based on a similar concept found in Embedded C. 
A limitation of OpenCL is that a specific pointer must be assigned 
to a particular address space and thus functions, for example, must 
say which pointer arguments point to which address spaces. This 
leads to a loss of composability and moreover can lead to imple- 
menting multiple versions of the same function. This problem is 
compounded in the OpenCL C++ variant where a class' implicit 
this pointer can be applied to multiple address spaces. 

Modern GPUs, such as AMD's Graphics Core Next and Nvidia's 
Fermi, support an additional generic address space that dynami- 
cally determines an address' disjoint address space, submitting the 
correct load/store operation to the particular memory subsystem. 
Generic address spaces allow for dynamic casting between generic 
and non-generic address spaces that is similar to the dynamic sub- 
typing found in objected oriented languages. The advantage of the 
generic address space is it simplifies the programming model but 
sometimes at the cost of decreased performance, both dynamically 
and due to the optimization a compiler can safely perform. 

This paper describes a new type system for inferring Cuda and 
OpenCL style address spaces. We show that the address space 
system can be inferred. We extend this base system with a notion 
of generic address space, including dynamic casting, and show that 
there also exists a static translation to architectures without support 
for generic address spaces but comes at a potential performance 
cost. This performance cost can be reclaimed when an architecture 
directly supports generic address space. 

1. Introduction 

Address spaces play a fundamental role in description of data lo- 
cality in programming languages, allowing the developer to explic- 
itly manage where data lives during program execution. Originally 
developed as a generic extension to the Embedded C [12] vari- 
ant of ANSI C, address spaces have recently gained popularity in 
programming languages for General Purpose Graphics Processing 
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Figure 1. Abstract memory model defined by OpenCL 



Units (GPGPU). In particular, Nvidia's Cuda [21] has support for 
disjoint address spaces as a type modifier, while Khronos' Open 
Compute Language (OpenCL) [22] formalizes them as type quali- 
fiers as a variant of Embedded C, shown diagrammatically in Fig- 
ure 1. A work-item is an instance of kernel at each projection point 
within a 3D iteration space, with access to its own private memory, 
a local memory that is shared between a collection of work-items 
(called a work-group), and finally a globally visible memory shared 
between all concurrently executing work-items. Each address space 
is disjoint and is assumed not to overlap. 

As an example consider the following code that scales a vector 
(A) by a constant(s) outputting a vector (C) : 

kernel void vscale (global int * C, global int * A, 

const global int * S) 

{ 

C[get_global_id(0)] = 

A[get_global_id(0)] * S [get_group_id(0)] ; 

} 
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The implication of OpenCL address spaces are that every pointer 
must be associated with an address space. The drawback with this 
approach, relaxed somewhat in Cuda's model, is the inability to pa- 
rameterize over address spaces, i.e. define parametric polymorphic 



1 The function get_global_id(size_t) returns the projection within the 
3D iteration space that the kernel is being executed over, the argument (0,1, 
or 2) selects the dimension. 
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Figure 2. Generic address space 



functions that are implicitly parametrized by address spaces. For 
example, consider a simple function to scale a value: 

int scale (global int * A, global int * S) ; 

It is easy to adapt the kernel vscale above to call scale. However, 
a typical optimization would be to move S into optimized on chip 
memory accessed via the read only memory segment constant: 

kernel void vscale (global int * C, global int * A, 
constant int * S) 

{ 

C[get_global_id(0)] = scale (&A [get_global_id(0)] , 

&S [get_group_id(0)] ) ; 

> 

Of course, this is will no longer type check as the type of S no 
longer matches scales second argument. To address this limitation 
we introduce an additional address space generic. (Note we assume 
generic to be the default address space and as such it can be elided 
in practice.) scale would be prototyped as: 

int scale (generic int * A, generic int * B) ; 

The generic address space defines a single address space that sub- 
sumes all others, as depicted in Figure 2. The particular placement 
of the different memory spaces within generic is an implementa- 
tion detail that is independent of our definition and use of generic, 
i.e. we could have equally have placed global at the far right of the 
address map and seen no semantic difference. 

In many cases it is straightforward to infer a specific address 
space instance in place of a generic one, using a modified version 
of the Hidley-Milner type inference algorithm [4, 10, 18]. However, 
in general it is not possible because a value within a generic address 
space may take on the type of multiple address spaces over its 
life time. For example, consider the following code that assigns 
a pointer in the global address space to each even work-item and a 
pointer in local address space to each odd work-item: 

void f oo (int *) ; 

kernel void bar( 

global int *g, local int *1) 

{ 

generic int * tmp; 

if (get_global_id(0) '/. 2) { 
tmp = g; 

} 

else { 

tmp = 1; 

} 

f oo (tmp) ; 

> 

In this case to preserve the single code base for all work-items the 
type of tmp needs to allow for the alternative address spaces global 
and local. Such alternatives are common in type systems and are 
known as variant or sum types, i.e. an alternative type of tmp is: 

global + local int *; 



with one alternative indicating that a value is in the global address 
space and the other in the local address space. 

Unfortunately, while the introduction of variant address spaces 
goes part of the way to providing types for generic address spaces, 
it requires that a particular address space component be uniquely 
determined at compile-time. The implication is that such a system 
does not support both variant address spaces and parametric poly- 
morphism over address spaces, i.e. it would fail to provide a prin- 
cipal type for foo in the above example. What is needed is a way of 
combining generic address spaces with variant address spaces. The 
type system proposed in this paper is just such a system. 

Before describing our system in detail we first consider the 
runtime implementation of such a system. In particular, due to 
the disjoint nature of address spaces, an implementation is free to 
support different load/store operations for different address spaces. 
An implication of such hardware targets is a compiler must be able 
to determine, at compile-time, a unique address space. However, 
not all targets have this limitation and in fact Nvidia's Fermi [20] 
and AMD's Graphics Core Next (GCN) [17] both support a notion 
of generic address space, providing a mapping conceptually similar 
to that of Figure 2. 

Even with hardware support for generic address spaces it may 
be beneficial to emit specialized (i.e. unique address space) loads 
and stores for performance or power considerations. For example, 
both Fermi and GCN provide specialized load and store operations. 

1.1 This paper 

The type system described in this paper combines the notion of 
generic address spaces with variant address spaces to provide 
a practical type system for languages such as Cuda [21] and 
OpenCL [22]. In particular, it supports polymorphic (i.e. generic) 
address spaces, extensibility (the ability to add or remove address 
spaces from a type), type inference, down casting to and from 
generic address spaces, and compilation. The type system is an 
application of qualified types, extended to deal with a general con- 
cept of polymorphic address spaces. Positive information about 
which address spaces are expected is captured in a given address 
space variable, called a row, using row extension, while negative 
information is reflected by the use of predicates. 

The most obvious benefit of this is that we can adapt results 
and properties from the general framework of qualified types — 
such as the type inference algorithm and the compilation method — 
without having to go back to first principles. The result is a consid- 
erable simplification of both the overall presentation and of specific 
proofs. 

One important aspect of our system is that while formally de- 
fined it has very practical implications. In particular, we contacted 
and worked with AMD's OpenCL team to implement a modified 
version of our algorithm in their implementation of OpenCL C++, 
a C++ language extension for programming OpenCL devices. We 
expect that our algorithm will form the foundation for specifying 
generic address spaces within OpenCL C too and could be used in 
a Cuda compiler flow to generate more refined address space access 
information. 

This paper makes a number of novel and important contribu- 
tions: 

• We develop a theory of address spaces that is applicable to 
Cuda, OpenCL and other Embedded C variants. Including, for- 
malizing address space type inference and support for generic 
(polymorphic) address spaces. 

• We show that our system can be applied to the dynamic 
generic address space of architectures such as AMD's GCN 
and Nvidia's Fermi. 
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Figure 3. The MiniAS concrete syntax. We assume non-terminals 
x for identifiers, c for constants, and r for row variables. The non- 
terminal op ranges over binary operations, e.g. +, — , etc. 



• We show that the theory of qualified types has application 
outside of its traditional application area, that of functional 
programming languages such as Haskell [16] and Habit [19]. 

• We have implemented a prototype compiler, in the functional 
language Haskell [16], that can be used by other compiler 
writers to develop practical implementations of their own. As 
noted above AMD's OpenCL compiler team have already done 
just this in practice, and is the foundation of their OpenCL C++ 
implementation found in the product APP SDK 2.7. 

A note to the reader; this paper describes a formal foundation 
for address space used in languages such as Cuda and OpenCL, 
including describing a system for inferring address space usage 
when emitted by the programmer. This paper does not describe par- 
ticular syntactic constructs for extending languages like Cuda and 
OpenCL with generic address spaces, however, a companion pa- 
per, also at this conference, describes an application of this model 
to OpenCL C++1 [8]. This companion paper also includes an eval- 
uation of OpenCL C++'s key features, including generic address 
spaces. It is the intention that these two papers be read together. 

The remaining sections of this paper are as follows: Section 2 
provides a general overview of our new type system, with a more 
detailed formal presentation in Section 3. This is followed by dis- 
cussion of type inference in Section 4 and compilation in Section 5. 
Section 6 discusses how the base system can be extended to sup- 
port casting to and from generic address spaces. Section 7 discusses 
related and previous work. Finally, Section 8 concludes with a dis- 
cussion of future work. 

2. Overview 

Since Cuda and OpenCL C are extensions to C++ and C, respec- 
tively, both of which are too complex for a concise formal defini- 
tion, we concentrate here on a subset of these languages that reflects 
essential aspects of the extensions. 

With out loss of generality we further choose to concentrate on 
the extended C subset of C as this simplifies the type system con- 
siderably. We note that it is straightforward to extend out system 
with class subtyping, handled with bounded polymorphism [2, 3], 



and handle recursive template subtyping with bounded polymor- 
phism [2, 3]. We call this language MiniAS. 

The abstract syntax for MiniAS programs is given in Figure 3. 

2.1 Basic operations 

Address space types are defined in terms of rows, and these are 
constructed by extension, starting from the empty row, {}. It is 
convenient to use the following abbreviations for rows: 

{oi, a n | r} = {ai | ...{a n j r}...} 
{ai,...,a„} = {ai j ...{a„ j {}}...} 

Intuitively, an address space of type ASpace {a | r} r * is a 
variant (or union) whose component a implies that the resulting 
pointer type r * can be in that address space, and whose component 
Aspace r ranges over some still to be determined address spaces. 
The basic operations for address spaces are: 

• Definition (injection): to define a pointer, with initializer, with 
an address space: 

r * x = _ :: (r\a) => r — > ASpace {a \ r} r * 
a t * x = _ :: (r\a) => r — > ASpace {a \ r} r * 

• Assignment (injection): to perform an assignment of a pointer 
in the same address space 

_ = _ :: (r\a) => ASpace {a \ r} r * 

— > ASpace {a | r} r * 
— > ASpace {a \ r} t * 

• Assignment (Embedding): to perform an assignment of a 
pointer with a new address space: 

= _ :: (r\a) => ASpacerr* 

— > ASpace {a \ r} r * 
— > ASpace {a \ r} r * 

• De-reference: to perform a de-reference from pointer address 2 : 

ld(_) :: ASpace{a] r * — > r 

Ida(-) :: {r\a) => (ASpace{a} r * — > r) 

— > ASpace {a | r} r — > r 

ld(-, -) :: (ASpace{global} t * — > r) 

— > (ASpace{local} t * — > r) 
— > (ASpace{private} r * — > r) 
— > ASpace r r * 
r 

The empty address space, {}, is the only value of type ASpace{} . 
Predicates are useful for the formal system described below but are 
not required to be written in MiniAS programs. When is clear from 
context that an address space is monomorphic, i.e. a unique address 
space, we will write: 

global int * x = . . . ; 
local int * y = . . . ; 
*x = *y; 

when formally we would have written: 

ASpace { global } int * x = . . . ; 
ASpace { local } int * y = . . . ; 
st_global (x , ld_local (y) ) ; 

In the case of polymorphic address spaces (i.e. row variables) 
we will just elide the address space altogether, e.g.: 



2 We elide the store case as it is denned in a similar fashion and adds little. 



ASpace r int * var = 0; 

will be written as 

int * var; 

dropping the zero initializer too. 

The interesting case is de-reference, when the particular address 
space is not known at compile time and thus must work for any 
valid address space. For example, consider then following: 

kernel void x( 
global * int g, 
local * int 1, 
int value) 

{ 

int * var = 0 ; 
if (value 7. 2) { 
var = g; 

> 

else { 

var = 1; 

> 

*g = *var; 

> 

In general, is not possible to know the address space for the de- 
reference, *var, and so the dereference operation must be able to 
perform a load from any address space. Filling in all the annotations 
the example would be written as: 

kernel void x( 

ASpace { global } * int g, 
ASpace { local } * int 1, 
int value) 

{ 

ASpace r int * var = 0; 
if (value 7, 2) { 
var = g; 

> 

else { 

var = 1; 

} 

store_global (g , 

ld(var, ld_global, ld_local, ld_private) ; 

> 

The example provides implementations, i.e. ld_aspace, for all pos- 
sible address spaces and thus is total, i.e. will not cause an unex- 
pected load from address space error. 

2.2 Implementation details 

The implementation of address spaces must select, in the fully 
generic case at runtime, the load or store instruction that matches an 
individual address space. To select a particular load loadjispace 
from an address space a, we need to know the address space ID 
representing the value a. Each address space is assigned an integer 
ID defined as follows 3 : 

global = 0 
local = 1 
private = 2 

MiniAS programs without generic address load/stores, i.e. standard 
OpenCL 1.2 programs, only contain load/stores whose address 



space is known, and hence the full type of a, is known at compile- 
time. 

In the more general case of generic address spaces it is not nec- 
essary to know the address space (ID) for every load and store at 
compile-time; instead, we treat unknown offsets as implicit param- 
eters whose values will be supplied at run-time when the full types 
of the load/stores concerned are known. Intuitively, load/stores are 
implemented as a jump table, where the ID provides the index into 
the jump table, selecting a specific load/store. This is essentially 
the compilation method of Gaster and Jones [9]. If for a moment 
we forget about typing issues, then the load(_, _, _, _, _) could be 
implemented as: 

load(idx, gld, lid, pld, addr) 
{ 

switch idx { 
case 0: 

return gld (addr) ; 
case 1 : 

return lid (addr) ; 
case 2 : 

return pld (addr) ; 

} 

} 

Of course, there are run-time overheads in passing offset values as 
extra parameters. However, an attractive feature of our system is 
these costs are only incurred when the extra flexibility of generic 
address spaces is required. Moreover, an architecture that supports 
generic address spaces directly, e.g. Nvidia's Fermi, can simply 
elide the additional parameters and jump-tables, issuing a single 
load or store instruction. Each predicate r\a in the type of a func- 
tion signals the need for an extra run-time parameter to specify the 
address-space used to determine the particular load/store. This one 
single extra piece of information is all that is needed to implement 
the full set of address space operations. 

The type checker gathers and simplifies the predicates generated 
by each use of an operator on address spaces. For example, the 
derived type, for the load of x, in the following: 

global int * x = . . . ; 
... *x . . . ; 

will generate a single constraint, {}\global. Predicates, like this, 
involving rows whose structure is known at compile-time, are eas- 
ily discharged by calculating the appropriate address space ID. Ob- 
viously, a compiler can use this information to produce efficient 
code by inlining and specializing to emit a specific load instruc- 
tion, for the corresponding address space. It is possible to show 
that for all MiniAS programs that use only explicit address spaces, 
i.e. OpenCL 1.2 programs, then all predicates can be statically de- 
termined and thus discharged at compile time. We return to this and 
additional static properties during the formalization of the model. 

Predicates that are not discharged within a section of code will, 
instead, be reflected in the type assigned to it. For an implemen- 
tation not supporting generic address spaces in hardware it is pos- 
sible to easily define rules that restrict generic address spaces for 
function arguments and variable definitions to be defined only for a 
single address-space instance. For example, a compiler might reject 
types which contain multiple predicates for the same row variable, 
but allow functions whose arguments are generic address spaces. 
This would allow: 

kernel void foo(int * x) 
{ 



3 This mapping is just one possible mapping. 



global int * g = . . . ; 
local int * 1 = . . . ; 

foo(g) ; 
foo(l) ; 

while disallowing: 

kernel void bar (global int * g, local int * 1) 
{ 

int * a; 

if (...) { 

a = g; 
> else { 

a = 1; 

} 



A consequence of these restrictions is that it is straightforward to 
define a translation, based on the notion of simplification [13], 
to specialize, at compile time, all predicated function calls, to 
a MiniAS program that is guaranteed to contain no discharged 
predicates 4 . 

3. Formal presentation 

MiniAS's type system is based on Jones' theory of qualified 
types [13] adapting the notion of subtyping for records and variants 
developed by Gaster and Jones [9]. 



3.1 Kinds 

k ::= * 
| row 
| aspace 
| fei — »- fea 
Intuitively the kind ki 
take something of kind ki 



the kind of all types 
the kind of rows 
the kind of address spaces 
function kinds 

— > ki represents the constructors that 
and return something of kind ki. The 



row kind is from Gaster and Jones' system [9]. The aspace kind 
represents the kind of address spaces and is new to the system 
presented in this paper. 

3.2 Types and constructors 

For each kind k, we have a collection of construtors C k (including 
variables a k of kind k: 



C k ::= X k 

\a k 

-ik' 



k' 



constants 
variables 
applications 
types 



The usual collection of types, represented here by the symbol 
t, is just the constructors of kind *. For the purposes of this paper, 
we assume that the set of constant constructors includes at least the 
following, writing X :: k to indicate the kind k associated with each 
constant X : 



{} 

{-I-} 
ASpace 



row 

aspace — > row — > row 
row —>*—>* 



function space 
empty row 

extension, for each address space 
address space construction 



The result of applying the function space constructor — > to two 
types t and r' is the type of functions from r to r', and is 
written as r — > r' in more conventional notation. Technically 



MiniAS functions are uncurried (i.e. of the form (tl, t„) — > 
r) but for ease of description we will use curryed notation (i.e. 

Tl — > ... — > T„ — > T). 

• The result of applying the ASpace constant to the empty row 
{} of kind row and some type r * is the type Aspace {} r * of 
kind *. 

• The result of applying an extension constructor {_ | _} to a type 
t and a row r is a row, usually written as {a | r}, obtained by 
extending r with a an address space a. Note that we include an 
extension constructor for each different address space a. 

The kind system is used to ensure that type expressions are well- 
formed. While it is sometimes convenient to annotate individual 
constructors with their kinds, there is no need in practice for a pro- 
grammer to supply these annotations. Instead, they can be calcu- 
lated automatically using a simple kind inference process [14]. 

We consider two rows to be equivalent if they include the same 
address spaces, regardless of the order in which they are listed. This 
is described formally by the equation: 

{a, a j r} = {a',a\ r} 

For the purposes of later sections, we define a membership 
relation, a e r, to describe when a particular address space a 
appears in a row r: 



a£{a | r} a € T [ a jL a >] 

a € {a' | r} 

and a restriction operation, r — a , that returns the row obtained 
from r by deleting the address space a: 

{a | r} — a = r 

{a | r} — a = {a \ r — a} 

It is easy to prove that these operations are well-defined with 
respect to the equality on constructors, and to confirm intuitions 
about their interpretation by showing that, if a G r, then r = 

{a | r — a}. 

3.3 Predicates 

The syntax for rows allows examples like {a, a} where the ad- 
dress space a appears in more than one field. Clearly, we do not 
want an address space to appear twice and some additional mech- 
anisms are needed to enable us to specify that a type of the form 
ASpace{a | r }, for example, is only valid if the row r does not also 
contain a. We achieve this using the lacks predicate of Gaster and 
Jones' [9]: 

7r :: C row \a predicates 

Intuitively, the predicate r\a can be read as an assertion that 
the row r does not contain the address space a. More precisely, 
we explain the meaning of predicates using the entailment relation 
defined in Figure 4. A derivation of P \— n from these rules can be 
understood as a proof that, if all of the predicates in the set P hold, 
then so does it. It is easy to prove that the relation )= is well-defined 
with respect to equality of constructors. 

fUWh P\=r\a a + a! P ^ {} \ a 

P \= K I r}\a 

Figure 4. Predicate entailment for rows. 



*This restriction is effectively Haskell's monomorphism restriction [16]. 
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PjT h e : 7T => p 



P|rhe:Va.cr PIT h e : a PIT h e : a £ TVCT) U TV(P) 

[VP] — — — [V/] 



P|r h e : [ T /a]a 
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[FCN] — [APP] 
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[Program] 



Figure 5. MiniAS Typing Rules 



3.4 Typing rules 

Following Damas and Milner [4], we distinguish between the sim- 
ple types t, described above, and type schemes, er, described by the 
grammar below: 

g ::= p I Va.p type schemes 
p ::= t I -K => t qualified types 

For simplicity of presentation and due to the fact that MiniAS 
does not support general polymorphic types, we restrict our presen- 
tation to type schemes with type variables of kind row. 

Restrictions on the instantiation of universal quantifiers, and 
hence on polymorphism, are described by encoding the required 
constraints as a set of predicates, P, in a qualified type of the form 
P => r. The set of free type variables in a object X is written as 
TV(X). 

The syntax for our term language is that of MiniAS, defined 
in Figure 3. Each of the address-space load and store operations 
is assigned a closed type scheme, a w / st . The typing rules are 
presented in Figure 5. 

4. Type inference 

This section provides a formal presentation of a type-inference 
algorithm for inferring address space usage. The most important 
feature is our adaptation of Gaster and Jones' [9] inserters for 
address spaces, to account for non-trivial equalities between row 
expressions during unification. 

4.1 Unification and insertion 

Unification is a standard tool in type inference, and is used, for 
example to ensure that the formal and actual address space param- 
eters of a function have the same type. Formally, a substitution is 
S is a unifier of constructors C, C G C k if SC = SC , and is 
a most general unifier of C and C' if every unifier of these two 



constructors can be written in the form RS, for some substitution 
R. 



(id) 

(bindL) 
(bindR) 



(apply) 



(row) 



[C/a] „ 

a ~ G 

n lc/a] „ 
C ~ a 

C~C UP ~ UP' 

cp u y cp' 

a G r' Ir ~ (Ir' — a) 

r , ui , 
{a I r) ~ r 



a i TV(C) 
a i TV(C) 



Figure 6. Kind-perserving unification 



The rules in Figure 6 provide an algorithm for calculating uni- 
fiers, writing C ~ C" for the assertion that U is a unifier of the 
constructors C, C" G C k . The first three rules are standard [25], 
and are even suitable for unifying to row expressions that list ex- 
actly the same components with exactly the same ordering in each. 
But the forth rule, (row), is needed to deal with the more general 
problems of row unification. 

To understand how this rule works, consider the task of unifying 
two rows {a \ r} and {a' \ r'}, where a, a' are distinct address 
spaces, and r, r' are distinct row variables. Our goal then is to find 
a substitution S that: 



{a I Sr} 



= S{a\r} 
= S{a'\r'} 
= {a'\Sr>} 



Clearly, the row on the left includes an a address space, while the 
last row on the right include an a address space. If these two 
types are to be equal, then we must choose the substitution S so 
that it will 'insert' the missing fields into the two rows r' and r, 
respectively. In this particular case, then we can choose: 

S = [{a! | r"}/r, {a j r"}/r'] 

where r" is a new type variable. 

More generally, we will say that a substitution S is an inserter 
of a into r G (j row [f a g Sr. S is a most general inserter of a 
into r if ever such an inserter can be written in the form RS, for 
some substitution R. The rules in Figure 7 define an algorithm for 
calculating inserters of a into r G C row . 

[{a\r'}/r 

(idVar) a G r r new 

k a. ' 
a G r a f= a 

(inTail) a e {a' \ r} 

id 

(inHead) a G {a r} 



Figure 7. Kind-perserving insertion 

The important properties of unification and insertion — both 
soundness and completeness — are captured in the following result: 

THEOREM 4.1. The unification (insertion) algorithm defined by 
the rules in Figure 6 ( Figure 7) calculates most general unifiers 
(inserters) whenever they exist. The algorithm fails precisely when 
no unifier ( inserter) exists. 

The proof is a straightforward variant of that given by Gaster [6], 
describing his system of records and variants. It is important to 
note that the unification algorithm is simplified, when compared to 
Gaster's original work, due to address space components not being 
labeled. 

4.2 A Type inference algorithm 

Given the unification algorithm described in the previous section, 
we can use an extended version of the type inference algorithm 
of qualified types [14] as a type inference algorithm for the type 
system presented in this paper. The definition of the algorithm 
is given in Figure 8. Following Remy [24], these rules can be 
understood as an attribute grammar; in each typing judgement 
P\TT \- w e : r, the type assignment V and the term e are 
inherited attributes, while the predicate assignment P, type r, and 
substitution T are synthesized. The (Program) w rule uses an 
auxiliary function to calculate the generalization of a qualified type 
p with respect to a type assignment V. This is defined as follows: 

Gen(r, p) = Va;.p, where{ai} = TV(p)\TV(T). 

In general the rules are straightforward modifications of the origi- 
nal typing rules given in Figure 5 for type inference. For example, 
the rule (IF) has additional hypothesis that perform unification on 
the inferred types of the condition and two alternatives, and appli- 
cations of the synthesized substitutions, but otherwise is the same 
as the original typing rule. As such the type inference algorithm is 
both sound and complete with respect to the original typing rules. 

THEOREM 4.2. The algorithm described by the rules in Figure 8 
can be used to calculate the principal type for a given declaration d 
under the assumptions Y. The algorithm fails precisely when there 
is no typing for d under F. 



The proof is again straightforward and follows directly from 
the earlier work of Gaster [6] and more generally Jone's system 
of Qualified Types [13]. 

5. Compilation 

In previous sections we described informally how programs involv- 
ing operations on address spaces can be compiled and executed 
using a language that adds extra parameters to supply appropriate 
offsets. This section shows how this process can be formalized, in- 
cluding the calculation of address space IDs. 

5.1 Compilation by translation 

In the general treatment of qualified types [13], programs are com- 
piled by translating them into a language that adds extra parameters 
to supply evidence for predicates appearing in the types of the val- 
ues concerned. The whole process can be described by extending 
the typing rules to use judgements of the form: 

which include both the original source term e and a possible trans- 
lation e . A further change here is the switch from predicate sets to 
predicate assignments; the symbol P used above represents a set of 
pairs (v : it) in which no variable v appears twice. Each variable v 
corresponds to an extra parameter that will be added during com- 
pilation; v can be used whenever evidence for the corresponding 
predicate it is required in e . 

In the current setting, predicates are expressions of the form 
(r\a) whose evidence is the address space ID for the particular 
a. The calculation of evidence is described by the rules in Figure 9, 
which are direct extensions of the earlier rules for predicate en- 

P U {v : ir} |= v : ir 

P \= e : (r\a) 
P\=m: {a' | r}\a m = f e ' 1 a ,< a ' 

p\=0-- (OVO 



Figure 9. Predicate entailment for rows. 

tailment that were given in Figure 4. Intuitively, a derivation of 
P \= e : 7T tells us that we can use e as evidence for the predi- 
cate 7r in any environment where the assumptions in P are valid. 
The second rule is the most interesting and tells us how to find the 
address space ID in a row {a | r}: 

• If a comes before a' in the total ordering, <, on address space 
IDs, then the required ID will be the same as the ID e of a in r. 

• But, if a comes before a, then we need to use an ID of e + 1 to 
account for the address of a'. 

In general, these rules calculate address IDs that are either a 
fixed natural number, or an addition from a natural number and one 
of the variables in P. 

For reasons of space, we omit the complete description of trans- 
lation from this paper, and instead focus on describing the two rules 
that account for the user and introduction of address space ID pa- 
rameters. The first of these is a variation on function application: 

P|rhe^e' :tt^ P P \= e" : it 
P\T h e ~» e e" : p 



P|7T \- w e : r Q\T'TF \- w z : r' 

[Empty] [Ret] 

{}|{} h w ; : void T'P U Q\T'TY h w return e; z : r 



P\TTh w e:r' Q\T'TY \- w z : r 
T'PUQ\T'TT\- W e;z : r 



[EXPR] 



(x : Vaj.P => r) G r ftr 



[VAR] 



P|Tr h w e : r" Tr" ~ r Q|TTr, t> : r h w 2 : r' 



[ft/aJPjr h w a : [/3;/a;]r U(T'P U Q)|£/T'7T h w r « = e; 2 : r' 

P\TTh w e-.T r-bool Q\T'TT \- w z 0 ; z : r' Q'\T"T'TT \- w z\\z: r" Ut' ~ Ut" 
U'U(P U Q U Q')\U'UT"T'TV h w if (e) {20} else {zi}; z : U W 



[VDECL] 



[If] 



Plm-^eir r^bool Q\T'TT h 2 : t' Ut' ^ void Q'\T"T'TT \- z' : t' 
U'U(P U Q U Q')\U'UT"T'TT h while (e) {z}; 2' : r" 

PlTr^. , «j : a; h 147 6 : r a; new 
- [FCN] 

P\T'T h w e : {U -+) Q 0 |T o TT h w e 0 : r 0 ... Q„|(T 0 ...T„_i)TT h w e n : r n 



[While] 



, u 0 

to ~ TO 



(E/n-i(...(t/o))*n ~ [/„_i(...({7 0 ))t„ -»■ a anew 



[App] 



U((t/„...[/o)(r 0 ...T n r')P,Q i )!(^n...t/ 0 )(T 0 ...r Il r')rh M/ x( Ci ) : ([/„.. .t/ 0 ) a 

Pi\TiT h w fi{vi){b} : n a i = Gen(T i r,P i ^n) P'\T'(T a ...T n r h ,f l :a l )^ w b':T' t' ^ void 
P'\T'(T 0 ...T n )T h w /i(«i){6j} main(){b'} : f/r' 

Figure 8. Type inference algorithm W. 



[Program] 



This tells us that we need to supply suitable evidence e" in the 
translation of any program whose type is qualified by a predicate 
7r. The second rule is analogous to function abstraction, and allows 
us to move constraints from the predicate assignment P into the 
inferred type 5 : 

PU{v : 7v}\F he-»e':p 

P|T he-> Xv.e' :ir=>p 

These two rules are direct extensions of the (=> E) and => I) in 
Figure 5, and combined with simple extensions of the other rules 
there, we can construct a translation for any term in Mini AS. 



6. Down casting 

As of today the OpenCL 1.x specification does not allow for cast- 
ing between pointers of different address spaces, it seems to make 
little sense when address spaces are disjoint. With the introduction 
of generic address spaces the ability to up cast to generic address 
space is built into the type system by default. However, the ability 
to down cast (i.e. translate from a generic address space to a special- 
ized one) may also useful. Moreover combining down casting with 
the ability to test if a generic address pointer is in a given special- 
ized address space allows one to call specialized library functions. 

5 For simplicity we have "cheated" a little by introducing the use of lambda 
abstraction not defined in MiniAS. However, in practice evidence abstrac- 
tion will only appear at function application and so an implementation 
would just add the additional argument to the associated function and thus 
there is no additional overhead or complexity. 



For example, consider the case when a 3 rd party library contains 
the specialized functions: 

int f oo_local(local * int) ; 
int foo_global (global * int); 

but does not contain a generic version. Using a cast operator sim- 
ilar to C++'s dynamic_cast — that either casts a generic address 
space to the specified address space, if it indeed matches the actual 
address space, or returns NULL — it is straightforward to define a 
generic version of foo: 

int foo (generic * int p) 
{ 

if ((global int * gptr = 

dynamic_cast<global int *>(p)) != NULL) { 
return f oo_global (gptr) ; 

} 

else if ((local int * lptr = 

dynamic_cast<local int *>(p)) != NULL) { 
return f oo_local (lptr) ; 

} else { 

return -1; 

} 



dynamic_cast<_>(_) can be assigned the type: 

Va, r.r\a => ASpace {a\r} * a — > a * a 

with specific versions implemented for each value of a, i.e. global, 
local, and private. For example after translation pseudo code for 
the global address space version might be defined as: 



dynamic_cast_global(idx, ptr) 
{ 

if (idx == 0) { 
return ptr ; 

> else { 

return NULL; 

} 

} 

Of course, a compiler targeting specific hardware that supports 
generic address spaces and instruction set that defines the ability 
to convert two and from specialized address spaces could generate 
code directly to these operations. For example, AMD's Heteroge- 
neous System Architecture (HSA) [1, 26] supports a selection of 
memory segments, most of which are disjoint (similar to OpenCL) 
and aflat address space that subsumes all others. HSA's Input Lan- 
guage (HSAIL) is a low-level device independent ISA supporting 
the following operations to translate to and from generic (flat) ad- 
dress spaces (segments): 

• Test if aflat pointer is in segment: 

segmentp_segment_bl dst , src 

• Convert flat pointer to segment: 

f tos_segment_type dst, src 

• Convert segment pointer to flat: 

stof _segment_type dst, src 

Assuming a set of compiler builtin functions mapping directly 
to HSAIL operations, then dynamic -Cast might be implemented 
directly as: 

dynamic_cast_global(ptr) 
{ 

if (segmentp_global(ptr) ) { 

return ftos_segment_i32(ptr) ; 

> else { 

return NULL; 

} 

} 

As HSAIL directly supports a notion of generic address space, then 
as discussed in Section 2 the address space ID argument has been 
elided by the compiler with no additional performance cost, even 
in the presence of dynamic typing. 

7. Related work 

To our knowledge we are the first to propose formalizing OpenCL's 
address spaces and provide a complete type inference system. Of 
course, there has been other approaches to abstracting user man- 
aged memories and also in the area of type inference for type qual- 
ifiers. In this section we discuss work most relevant to ours. 

7.1 A theory of qualifiers 

We are not the first to study type inference for type qualifiers for 
C and the most relevant is the work of Foster et al [5]. They de- 
scribe a system that is capable of inferring static type qualifiers, in- 
cluding the ability to have polymorphic functions parameterized by 
type qualifiers. However, their system is limited to type qualifiers 
that can be uniquely determined at compile time. They provide no 
system for dynamically selecting between different types of quali- 
fiers as is necessary in the general case and in particular for address 
space load and store operations. 



Forster et al system has the goal to enforce and discover static 
invariants that can help the compiler produce more efficient code 
and rule out unintended program errors. Their system does pro- 
vide the ability to refine a types qualification, i.e. add const to a 
pointer, but lacks the capability for the same value to have multiple 
alternatives for a particular qualifier. In fact their sub-type relation 
is closely related to Gaster and Jones' [9] extensible record type 
which introduces a similar lattice type structure. 

Additionally Foster et al's system is based on a complicated sys- 
tem of sub-typing constraints which can often lead to complicated 
types which are difficult for developer to write down in practice. 
Our system, on the other hand, has simple types which can be eas- 
ily written down by the developer, although it is not required. 

One area of future work would be to incorporate a variant of 
Gaster and Jones' extensible records into our system to handle the 
qualifiers of Foster et al. In particular, as such as system would 
support our compilation scheme, then they might be interesting 
qualifiers not expressible in Foster et al's system. 

7.2 OpenCL C++ 

OpenCL C++ [8] supports OpenCL C address spaces but comes 
with the additional complication of a classes this. The this pointer 
complicates matters as it is often left implicit but effects how 
particular member functions will behave. For example, consider a 
classes copy constructor which may be implicitly generated by the 
compiler from a particular use case. In general, the copy constructor 
is of the form: 

Foo(Foo & rhs) ; 

But for OpenCL C++, Foo& must live in an address space, which if 
implicitly defined, like the copy constructor itself, must be inferred 
at compile time. For this OpenCL C++ uses a subset of the type 
inference algorithm formalized in this paper, whose address spaces 
are all uniquely inferred at compile time. In developing OpenCL 
C++ AMD initially implemented an ad-hoc type inference algo- 
rithm for address spaces, which time and time again caused pain in 
the discovery of corner cases that had not been originally consid- 
ered. This was only conflated with the move to C++1 1 [11], which 
has its own inference rules for auto and decltype. Motivated by 
these short comings in the original implementation of OpenCL C++ 
and a desire to add generic address spaces to OpenCL C we were 
motivated to develop the system described in this paper. 

7.3 Heterogeneous Parallel Pattern 

The GPGPU programming model Heterogeneous Parallel Patterns 
(HPP) [7] is a braided parallel model, supporting both task and 
data-parallelism, that is embedded into C++1 1. Like OpenCL C++ 
it supports a device programming language that is designed to 
target GPUs and other accelerator like devices. However, unlike 
OpenCL C++ it does not expose explicitly managed address spaces 
and instead it supports a PGAS style globally shared memory 
model combined with a hierarchical array abstraction, called Dis- 
tArray. Like address spaces DistArray is intended to allow devel- 
opers to explicitly manage data locality, however, this is achieved 
via describing a hierarchical nesting of regions bound on use. 

A key difference when compared to our system is that data 
placement for HPP's DistArray is dynamically determined by the 
runtime, while our system is based on static type inference. Of 
course, in the case that an address space cannot be uniquely deter- 
mined at compile time our system still requires a runtime parameter 
to determine which memory to read and write from. In developing 
our system we were constrained to design a system that fitted with 
an existing programming language (OpenCL), and did not have the 
freedom to design a language without address spaces. 



8. Conclusion 

We have described a flexible type system for generic (polymorphic) 
address spaces with effective type inference algorithm and compi- 
lation method. A prototype implementation has been written as a 
standard alone compilation flow allowing developers an researchers 
alike to study the algorithm in isolation from a more complicated 
system, such as OpenCL. An implementation of our algorithm has 
also been implemented as part of AMD's APP SDK, which sup- 
ports a variant of OpenCL C++. Our experience to date shows that 
these implementations work well in practice. Furthermore, generic 
address spaces have been proposed as a future feature of standard 
OpenCL C and the algorithm proposed in this paper can form a 
foundation for this development [23]. 

More generally our system can form the basis of an address 
space system for an variant of Embedded C and to our knowledge 
we are the first to propose such as system. Such a system could be 
used as the foundation to extend Embedded C to C++, similar to 
OpenCL C++'s extension to OpenCL C, and could prove useful 
for future embedded system development. Without similar type 
inference system for address spaces it is hard to see how Embedded 
C++ could be useful in practice. 

The above extensions to OpenCL are, at present, restricted to 
compile time specialization, i.e. all predicates must be discharged 
statically otherwise it's a compile time error. Our system is more 
general and provides the ability to support a dynamic generic ad- 
dress space, via the introduction of address space IDs that are 
passed around at runtime and used to dynamically select the spe- 
cific address space operation. While this could have a potential run- 
time impact we highlight this is only observed when dynamic fea- 
tures are used by the developer. Moreover, we demonstrated that if 
the underlying system directly supports generic address spaces, as 
per AMD's HSAIL or Nvidia's Fermi, then the additional address 
space IDs can be elided, without any additional runtime overhead. 

One interesting area of future work is to consider providing a 
formal type system for something like HSAIL, including handling 
of its segments and flat address space. The system proposed in this 
paper may in turn be an interesting place to begin that work. For 
one it describes the semantics of address spaces (i.e. segments) 
but moreover Jones has shown that more generally the theory of 
qualified types can be used to statically type and verify intermediate 
languages such as Java's bytecode [15]. 
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